[Performance] Sequential onloading by kylesayrs · Pull Request #1263 · vllm-project/llm-compressor

kylesayrs · 2025-03-18T05:01:02Z

Sequential Onloading

(25/33): Calibrating:   0%|                                                                                                                                  | 0/512 [00:00<?, ?it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> cuda
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> cuda
...
(25/33): Calibrating: 100%|█████| 512/512 [00:23<00:00, 21.91it/s]
2025-06-03T17:29:15.536963-0400 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples
2025-06-03T17:29:17.328720-0400 | compress | METRIC - time 1.79s
2025-06-03T17:29:17.329265-0400 | compress | METRIC - error 8948.54
2025-06-03T17:29:17.329781-0400 | compress | METRIC - GPU 0 | usage: 5.41% | total memory: 85 GB
2025-06-03T17:29:17.330248-0400 | compress | METRIC - Compressed module size: 33.947648 MB
...
(25/33): Propagating: 100%|█████| 512/512 [00:03<00:00, 131.16it/s]
<class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_scale -> meta
<class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> meta
...

Purpose

Reduce hardware requirements for calibrating large models
Reduce runtime caused by excess device movement when calibrating offloaded models

Prerequisites

Related Issues

Resolves Load the model to CPU but quantize using the GPU #1383
Resolves Quantization Memory Requirements #1228
Resolves Whole model gets offloaded to the CPU #1122
Resolves Add support for W8A8 quantization with CPU weight offloading #1078
Resolves Lazy Loading of Weights for Large Model Quantization #1216
Resolves [Qwen2-72B-Instruct] CUDA out of memory when quant model with gptq #1483

Changes

Keep layer parameters onloaded during the entire sequential calibration + compression + propagation step
- This is achieved through the keep_onload_context, which disables offloaded until the context is exited
Dispatch model within each calibration pipeline
- Sequential pipeline offloads the model to CPU, and executes on the first cuda device
- Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument
Use sequential pipeline as default pipeline (basic pipeline is never used)
- Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument
Dispatch model before sample generation
- The model is dispatched exactly as it would be if it was loaded with device_map="auto"

Examples

Models are loaded onto cpu before oneshot (rather than being dispatched across GPUs)
Model is reloaded from disk in order to redispatch onto "auto" device map
- In my opinion, this is a better flow anyways, since models can raise errors / take a very long time during generation, which can cause the entirely compression job to go to waste
- The alternative is to either call accelerate.remove_hooks(model) and accelerate.dispatch_model(model) before generating, or get rid of sample generation entirely. One of these may be required if compressed_linear isn't reliable enough to add to our examples

New example script

from transformers import AutoModelForCausalLM

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.utils.dev import dispatch_for_generation

# Load model (on cpu)
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")  # model is loaded on cpu
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define recipe
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

# Apply oneshot (model execution device is set to cuda, model stays on cpu)
oneshot(
    model=model,
    dataset="ultrachat_200k",
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
)

# Perform sample generation
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk before generating
SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Testing

Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single H100 in 50 seconds
- 4.5x Improvement over original 236 seconds
- Peak memory of ~40 GB, which can be further reduced by increasing the granularity of sequential targets
Not offloading activations did not result in a performance improvement

github-actions · 2025-03-18T05:01:11Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

src/llmcompressor/utils/helpers.py

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pipelines/sequential/helpers.py

brian-dellabetta

sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look

brian-dellabetta

I am understanding this for the most part -- very cool!

src/llmcompressor/utils/helpers.py

kylesayrs · 2025-06-03T18:18:09Z

def DoNotOffloadContext():
	  to_offload = {}
    def patched(module):
        to_offload.add(module)
        return None

    with patch_attr(AlignDeviceHook, "post_forward", patched)
        yield
        
    # offload on exit
    for module in to_offload:
        module.post_forward()
        
        
for subgraph in subgraphs():
    with DoNotOffloadContext():
        subgraph(**inputs)

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-06-16T21:51:04Z

Nightly passes
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15689881695

brian-dellabetta

awesome stuff!

src/llmcompressor/pipelines/basic/pipeline.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gemini-code-assist · 2025-06-16T22:25:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kylesayrs · 2025-06-16T23:17:32Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15694064423

brian-dellabetta

pretty sweet that we're able to remove ~500 lines of code while adding a huge feature like this 🔥

dsikka

Great work

## Purpose ## * Speed up tests by reducing device movement ## Background ## As of #1263, the model is dispatched to different device maps depending on which pipelines are used. If the model starts on anything but the CPU, then these dispatches and undispatches create device movement. Starting on the CPU will ensure no device movement occurs when offloaded dispatches happen. Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

…ssed (vllm-project#1530) ## Purpose ## * Fix failing examples ## Changes ## * Save model after generation in all examples * Previously, models would be saved before generation, causing generation to fail because we do not fully support generating with compressed models atm ## Future ## * In the future, we can define a better API around compressing and decompressing models which does not require so many arguments * In the future, we can standardize around reloading (and redispatching) the model before generation, as suggested here vllm-project#1263 * In the future, we can remove the sample generation step Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

# Sequential Onloading # <p align="center"><img width="403" alt="Screenshot 2025-06-05 at 22 53 01" src="https://github.com/user-attachments/assets/ffd610ac-c511-4dc1-b858-b0ed2bf95193" /></p> ``` (25/33): Calibrating: 0%| | 0/512 [00:00<?, ?it/s] <class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> cuda <class 'torch.nn.modules.linear.Linear'>.weight -> cuda <class 'torch.nn.modules.linear.Linear'>.weight_scale -> cuda <class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> cuda ... (25/33): Calibrating: 100%|█████| 512/512 [00:23<00:00, 21.91it/s] 2025-06-03T17:29:15.536963-0400 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples 2025-06-03T17:29:17.328720-0400 | compress | METRIC - time 1.79s 2025-06-03T17:29:17.329265-0400 | compress | METRIC - error 8948.54 2025-06-03T17:29:17.329781-0400 | compress | METRIC - GPU 0 | usage: 5.41% | total memory: 85 GB 2025-06-03T17:29:17.330248-0400 | compress | METRIC - Compressed module size: 33.947648 MB ... (25/33): Propagating: 100%|█████| 512/512 [00:03<00:00, 131.16it/s] <class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> meta <class 'torch.nn.modules.linear.Linear'>.weight -> meta <class 'torch.nn.modules.linear.Linear'>.weight_scale -> meta <class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> meta ... ``` ## Purpose ## * Reduce hardware requirements for calibrating large models * Reduce runtime caused by excess device movement when calibrating offloaded models ## Prerequisites ## * vllm-project/compressed-tensors#354 * vllm-project/compressed-tensors#355 * vllm-project/compressed-tensors#356 * vllm-project/compressed-tensors#357 ## Related Issues ## * Resolves vllm-project#1383 * Resolves vllm-project#1228 * Resolves vllm-project#1122 * Resolves vllm-project#1078 * Resolves vllm-project#1216 * Resolves vllm-project#1483 ## Changes ## * Keep layer parameters onloaded during the entire sequential calibration + compression + propagation step * This is achieved through the `keep_onload_context`, which disables offloaded until the context is exited * Dispatch model within each calibration pipeline * Sequential pipeline offloads the model to CPU, and executes on the first cuda device * Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument * Use sequential pipeline as default pipeline (basic pipeline is never used) * Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument * Dispatch model before sample generation * The model is dispatched exactly as it would be if it was loaded with `device_map="auto"` ### Examples ### * Models are loaded onto cpu before oneshot (rather than being dispatched across GPUs) * Model is reloaded from disk in order to redispatch onto "auto" device map * In my opinion, this is a better flow anyways, since models can raise errors / take a very long time during generation, which can cause the entirely compression job to go to waste * The alternative is to either call `accelerate.remove_hooks(model)` and `accelerate.dispatch_model(model)` before generating, or get rid of sample generation entirely. One of these may be required if compressed_linear isn't reliable enough to add to our examples <details><summary>New example script</summary> ```python3 from transformers import AutoModelForCausalLM from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot from llmcompressor.utils.dev import dispatch_for_generation # Load model (on cpu) model_id = "meta-llama/Meta-Llama-3-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") # model is loaded on cpu tokenizer = AutoTokenizer.from_pretrained(model_id) # Define recipe recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) # Apply oneshot (model execution device is set to cuda, model stays on cpu) oneshot( model=model, dataset="ultrachat_200k", recipe=recipe, max_seq_length=2048, num_calibration_samples=512, ) # Perform sample generation print("\n\n") print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0])) print("==========================================\n\n") # Save to disk before generating SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ``` </details> ## Testing ## * Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single H100 in 50 seconds * 4.5x Improvement over original 236 seconds * Peak memory of ~40 GB, which can be further reduced by increasing the granularity of sequential targets * Not offloading activations did not result in a performance improvement * TODO: Test all example models can be reloaded and run --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>

## Purpose ## * Speed up tests by reducing device movement ## Background ## As of vllm-project#1263, the model is dispatched to different device maps depending on which pipelines are used. If the model starts on anything but the CPU, then these dispatches and undispatches create device movement. Starting on the CPU will ensure no device movement occurs when offloaded dispatches happen. Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

kylesayrs added the ready When a PR is ready for review label Mar 18, 2025

kylesayrs self-assigned this Mar 18, 2025

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

brian-dellabetta self-requested a review March 18, 2025 14:17

dsikka requested changes Mar 18, 2025

View reviewed changes

brian-dellabetta reviewed Mar 18, 2025

View reviewed changes

kylesayrs requested review from brian-dellabetta and dsikka March 18, 2025 17:12

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

kylesayrs dismissed brian-dellabetta’s stale review via 6a3733a March 18, 2025 20:03

kylesayrs mentioned this pull request Mar 18, 2025

[Utils] add align_modules vllm-project/compressed-tensors#282

Merged

brian-dellabetta mentioned this pull request Mar 18, 2025

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

Closed

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

src/llmcompressor/utils/helpers.py Outdated Show resolved Hide resolved

kylesayrs dismissed brian-dellabetta’s stale review via 63d1934 March 19, 2025 04:07

brian-dellabetta previously approved these changes Mar 19, 2025

View reviewed changes

kylesayrs mentioned this pull request Mar 21, 2025

Has anyone successfully quantinize Deepseek-R1 to w8a8? #1274

Closed

kylesayrs dismissed brian-dellabetta’s stale review via cf09876 March 27, 2025 17:52

kylesayrs force-pushed the kylesayrs/sequential-onloading branch from cf09876 to 72e7683 Compare March 27, 2025 17:53

kylesayrs mentioned this pull request Apr 8, 2025

Extend usability of calculate_offload_device_map #768

Closed

This was referenced Apr 24, 2025

Quantization Memory Requirements #1228

Closed

Load the model to CPU but quantize using the GPU #1383

Closed

brian-dellabetta mentioned this pull request May 14, 2025

NotImplementedError: Cannot copy out of meta tensor; no data! when trying to run AWQ #1432

Closed

This was referenced Jun 2, 2025

[AWQ] Insane memory requirement: over 900GB for 32B model #1409

Closed

[DeepSeek-V3/R1] Anyone had success quantizing DeepSeek-V3 using llm-compressor? #1482

Closed

AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406

Closed

kylesayrs force-pushed the kylesayrs/sequential-onloading branch from 382c3e6 to 7586733 Compare June 3, 2025 21:31

kylesayrs removed the ready When a PR is ready for review label Jun 3, 2025

remove device arg from e2e

7d7b00d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added the ready When a PR is ready for review label Jun 16, 2025

kylesayrs added 3 commits June 16, 2025 15:07

simplify pipeline inference logic, add comment

501056e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

update examples imports

74aa7c9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix call

e4487e2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Jun 16, 2025

[Tests] Start oneshot tests on CPU #1555

Merged

brian-dellabetta previously approved these changes Jun 16, 2025

View reviewed changes

src/llmcompressor/pipelines/basic/pipeline.py Outdated Show resolved Hide resolved

fix initial device, update import

931e4e9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 931e4e9 June 16, 2025 22:24

brian-dellabetta mentioned this pull request Jun 17, 2025

NotImplementedError: Operator aten.amin.default does not have a sharding strategy registered. #1537

Closed

Merge branch 'main' into kylesayrs/sequential-onloading

6edc523

brian-dellabetta approved these changes Jun 17, 2025

View reviewed changes

dsikka approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into kylesayrs/sequential-onloading

f3f3eb8

dsikka enabled auto-merge (squash) June 17, 2025 19:58

dsikka merged commit f4e484d into main Jun 17, 2025
18 checks passed

dsikka deleted the kylesayrs/sequential-onloading branch June 17, 2025 20:45

brian-dellabetta mentioned this pull request Jun 18, 2025

AWQ does not actually Quant, outputs slightly bigger size file. - CoHere Model #1566

Closed

shanjiaz mentioned this pull request Jun 18, 2025

[BugFix] Fix quantizaiton_2of4_sparse_w4a16 example #1565

Merged

kylesayrs mentioned this pull request Jun 23, 2025

[Modeling] Fix encoder CPU offloading for whisper huggingface/transformers#38994

Merged

brian-dellabetta mentioned this pull request Jun 24, 2025

[Feature] Log/info/Save/Restore quantization steps #1410

Closed

kylesayrs mentioned this pull request Jun 30, 2025

[MoE] Cleanup MoE examples #1576

Merged

Conversation

kylesayrs commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sequential Onloading

Purpose

Prerequisites

Related Issues

Changes

Examples

Testing

Uh oh!

github-actions bot commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented Jun 3, 2025

Uh oh!

kylesayrs commented Jun 16, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 16, 2025

Uh oh!

kylesayrs commented Jun 16, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kylesayrs commented Mar 18, 2025 •

edited

Loading